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Abstract 

Automatically classifying the tissues types of Region of Interest (ROI) in med- 
ical imaging has been an important application in Computer-Aided Diagno- 
sis (CAD) , such as classification of breast parenchymal tissue in the mammo- 
gram, classify lung disease patterns in High-Resolution Computed Tomography 
(HRCT) etc. Recently, bag-of-features method has shown its power in this field, 
treating each ROI as a set of local features. In this paper, we investigate us- 
ing the bag-of-features strategy to classify the tissue types in medical imaging 
applications. Two important issues are considered here: the visual vocabu- 
lary learning and weighting. Although there are already plenty of algorithms 
to deal with them, all of them treat them independently, namely, the vocabu- 
lary learned first and then the histogram weighted. Inspired by Auto-Context 
who learns the features and classifier jointly, we try to develop a novel algorithm 
that learns the vocabulary and weights jointly. The new algorithm, called Joint- 
ViVo, works in an iterative way. In each iteration, we first learn the weights 
for each visual word by maximizing the margin of ROI triplets, and then select 
the most discriminate visual words based on the learned weights for the next 
iteration. We test our algorithm on three tissue classification tasks: identifying 
brain tissue type in magnetic resonance imaging (MRI), classifying lung tissue 



•Corresponding author. Tel: 966-544701866 
Email address: j ingyan. wangOkaust . edu . sa (Jingyan Wang ) 



Preprint submitted to Elsevier 



August 21, 2012 



in HRCT images, and classifying breast tissue density in mammograms. The 
results show that Joint- ViVo can perform effectively for classifying tissues. 
Keywords: Computer- Aided Diagnosis, Tissue Classification, Bag-of-Features, 
Visual Vocabulary, Visual Word Weighting 



1. Introduction 

Automated Computer Aided Diagnosis (CAD) systems are playing an im- 
portant role in modern medical practices [l[ y). Accurate classification of 
medical images according to tissue type at the region of interest (ROI) level is 
important in many CAD applications [4]. A typical application is the diagno- 
sis of breast cancer using mammogram as the medical imaging technology Q . 
From a medical point of view, it is well-known that there is a strong positive 
correlation between high breast parenchymal density and high breast cancer 
risk. Thus, the development of automatic methods for classification of breast 
parenchymal tissue in the mammogram is justified for an automatic risk assess- 
ment framework in prospective CAD systems. Several techniques have been 
proposed for breast density classification using mammogram [7 1 . Another 
typical application is the diagnosis of diffuse lung diseases (DLDs), which are 
a heterogeneous group of diseases that affect the lung parenchyma in various 



ways High-resolution computed tomography (HRCT) [9j is useful to char- 
acterize DLDs because it provides better delineation of small structures and 
details within the lung. However, interpreting HRCT images is difficult even 
for specialists because of the complexity and variation in diffuse disease patterns. 
Therefore, CAD system to classify lung disease patterns is required. In recent 
years, many automated techniques have been proposed to classify diffuse lung 
disease patterns into several classes such as ground-grass opacities, reticular and 
linear opacities, honeycombing, emphysematous change, and so on [lol |. 

Most of the tissue classification techniques extract features from medical im- 
ages for classification with texture analysis approaches such as the gray level 
co-occurrence matrix (GLCM) [11] etc., which measure spatial dependencies of 
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intensity values within a region of interest (ROI) as second and higher statis- 
tics. Although they offer significant discriminatory power between various tissue 
patterns, the approaches do not work well for the patterns with inhomogeneous 
texture distribution within a ROI, such as the reticular patterns and the honey- 
combing patterns, because the statistics can only capture averaged feature over 
the ROI. To overcome this difficulty, the bag-of-features model is introduced for 
tissue classification of ROI in medical image, which have recently been proven 



to be effective for image retrieval task 



I 4 [li, Q Q- Barnathan et, al. Q 



proposed a methodology for discriminating between various types of normal and 
diseased brain tissue in medical images that utilizes bag-of-features, to extract 
discriminative texture features. Rather than focusing on images of the entire 
brain, they extracted local descriptors for individual ROI as determined by do- 
main experts, and represented it as a frequency of codebook. Kato et, al. [lol ] 
proposed a bag-of-features approach for improvement of lung tissue classifica- 
tion in diffuse lung disease. In their model, images are represented as histograms 
or distributions of several types of local features that are obtained from training 
samples automatically. Bosch et, al. Q presented a bag-of-features based ap- 
proach to model and classify breast parenchymal tissue in mammogram, using a 
classifier based on local descriptors and probabilistic Latent Semantic Analysis 
(pLSA) 17[, which a generative model from the statistical text literature. 

As argued by Cai et, al. (l8| . the Visual Vocabulary (ViVo) (or codebook) 
plays the key role in the bag-of-feature model, and it is a collection of vector 
quantized features. The mostpopular way of creating visual vocabulary is by 
using k-means clustering 19|, l20| or its variant, i.e., hierarchical k-means. How- 
ever, it is argued that k-means does not select the most informative descriptors 
as it tends to concentrate the cluster centers in high density areas of the feature 

21 . 22I ]. To deal with this problem, two 



space and starves lower density ones 
kinds of strategies have been proposed: 

• Learning discriminative visual vocabulary: In 
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3, 



radius-based 



clustering is used for visual vocabulary (while) generation. In [23[, su- 
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pervised learning of quantizer codebooks is proposed by information loss 
minimization. 



• Weighting visual words: In [18J, Cai et al. presented a visual word 
weighting strategy by learning a weighted similarity metric to satisfy that 
the weighted similarity between the same labeled images is larger than that 



between the differently labeled images with largest margin. In 
Wang et al. proposed a novel visual word weighting method by analyzing 
the discriminative power of each visual word by the sub-similarity function 
in the bin that corresponds to the visual word. 

Up to now, the above two strategies are always used independently. First, 
the visual vocabulary is learned and then the weighting factors are estimated 
for each visual word. In this case, employing discriminative visual vocabulary 
generation methods without taking the visual words weighting into account is 
suboptimal. On the one hand, the bag level features is based on the histogram of 
the local features, which is directly determined by the visual vocabulary. While 
on the other hand, the visual word weighting is learned by the supervision of the 
labels of the features in Q> Q] based on the bag-level features of ROI, and 
then a classifier will be trained for the weighted bag-of-features. Apparently, the 
weighting of the bag level feature has close relationship with the construction 
of the bag level feature itself, which is furthermore determined by the visual 
vocabulary. A better way is to learn the visual vocabulary and the weighting 
factors jointly. Nonetheless, it seems that the problem of deriving a joint pro- 
cedure in the presence of combined clustering and weighting has received no 
attention. 

Inspired by Auto-context 25 



which learn the classifier and features 
jointly in a iterative algorithm, we try to develop a novel joint visual vocabulary 
learning and visual words weighting estimation algorithm. In a transitional way 
for pattern classification, we first extract the features for samples and then 
train a classifier for these samples features. The feature extraction and classifier 
training is done inexpediently, assuming that the feature construction and the 
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classifier are not related. A break-through thought proposed by [25|, |26| is that 
the features of samples can also be constructed by previous classification results 
in an iterative procedure, and thus the subsequent classifier will be trained 
using the newly constructed feature. The most important highlight of this 
algorithm is combining classification map V [t) {i) and image patch X(Ni) as 
features of pixel i for the training of classifier . The classification map (i) 
is determined by classifier Fu in turn. On the other hand, the classifier 
Fjf will also be trained using V^\i) as part of features. In this way, we can 
learn the feature and the classifier jointly in an iterative way. Now we take a 
look at the bag-of-features methods: the visual word weighting is learned for 
fixed bag level features, which are in turn constructed according to the visual 
vocabulary, and the vocabulary itself is learning using a clustering algorithm. 
Here, like Auto-Context, we try to guide the clustering procedure according to 
the weighting learned. That means we re-select the cluster center of the training 
local feature set according to an object function used to learn the visual word 
weights. We repeat this in a iterative algorithm until convergence. In this way, 
we can jointly learn a discriminative visual vocabulary and its corresponding 
visual word weights, so we call it Joint- ViVo algorithm. In contrast to many 
independent visual vocabulary learning methods proposed in [21I Hi and in 
contrast to the weighting factors estimation method proposed in [18|, |24|, the 
joint method we propose is neither a novel vocabulary learning algorithm nor 
does it assign weights to learned visual words. It jointly select local features from 
the training sets and assign them with weights iteratively. The learned visual 
vocabulary and its weighting should be used together to achieve an optimal 
performance. 

In Section [31 we rigorously formulate the problem addressed in this paper 
by introducing the bag-of-features framework. In section [3J we introduce the 
proposed visual words selection and weighting algorithm — Joint- ViVo. Section 
01 introduces the experimental methodology and reports experimental results 
and discussions. Section [5] concludes this paper with future work. 
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2. Bag-of-Features Based Tissue Classification Framework 

Below, we present a bird's eye view of bag-of-features based tissue clas- 
sification in medical images. A block diagram of bag-of-feature based tissue 
classification system is shown in Fig [T] 
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Figure 1: Block diagram of bag-of-feature based tissue classification system. 
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ROI Segmentation The first stage of the system is to segment the med- 
ical image and extract the regain of interest (ROI) . The obtained training 
ROI set is denoted as M = {B n ,y n }^ =1 , where y n is n-th ROI's class 
label. Varies segmentation ROI methods have been proposed. Since this 
is not the focus of this paper, we simply apply the existed segmentation 
methods [7J or segment ROI manually [1CJ, [l6( . 

Local Features The next stage is to represent a ROI B n as a collection 
of local features, such as image patches [2jJ ( also called textons, intensity 
descriptor, or small block) and key points with SIFT descriptors [28|]. We 
represent ROI B n as a bag, which contains m n local features denoted by 



Visual Vocabulary Generation To obtain a visual vocabulary V 



{vj}Y = i of size V, we usually apply a clustering algorithm on the training 



local feature sets U = {xl r }f =1 . The cluster centers Vj,j = 1, • ■ • , V will 
be used as visual words. 



Bag-Level Presentation We apply the kernel codebook proposed in 
to represent a image ROI B n = {x™}™^ as a soft histogram of visual 
vocabulary V, generating a V-dimensional frequency vector H„ 



h„ =[h n (l),h n (2),--- ,h n (V)] T 

(1) 

- > AVr ■ r,) 



1 mn 



where K(xf — Vk) = j^ a ex V f~ ^2o^ ) ^ s Gaussian-shaped kernel and 
a is the smoothing parameter of kernel K . 

Visual Word Weighting An important procedure in bag-of-feature based 
image retrieval and classification is to weight the histogram vectors accord- 
ing to its discriminant ability: 

fn(j) = Wj ■ h n (j) 

(2) 

fn =[/„(!) /„(2) ••• fn(V)] T 



7 



where Wj is the weight for j-th visual word. 

• Classification After each ROI in a medical image is represented as a 
bag- level feature vector f n , we can train a classifier to distinguish the ROI 
according to its tissue types. 

3. Joint- ViVo: Joint Learning and Weighting of Visual Words 

In this section, we will firstly discuss the weighting of visual words given the 
visual vocabulary, using an object function designed to distinguish a triplet of 
ROIs. Then we give a novel visual words updating methods according to the 
learned weighting to minimize the object function. Finally, we combine these 
tow procedure and obtain an iterative visual words selection and weighting 
algorithm, called Joint- ViVo. 

3.1. Discriminative Visual Word weighting 
3.1.1. Bag-level Similarity and Distance Vector 

The histogram intersection kernel s(h p , h,j) = y\ min(h p (j), h q (j)) is com- 



monly used to compute the similarity between a pair of bag-level feature [301 ] . 
The intersection of the j-th visual word occurrence frequency between bag B p 
and bag B q is denoted by s pq (j) = min(h p (j) 1 h q (j)) in the bag-of-features 
model. Accordingly, s pq represents the intersection vector between bag B p and 
bag B q : 



pq (l) s pq (2) ••• s pq (V)] T (3) 

In the typical bag-of-features model, the similarity between two bags is the sum 

I — I 

of the equally weighted intersections: s(B p ,B q ) = J2 s pq(j) uM- I n contrast, 

i 

we assign different weights for visual words, resulting in a weighted similarity 
defined as s vr (B p , B q ) = w T s pq = ^2wjS pq (j), which is a similarity metric 
according to the definition in [18j |. 

Instead of measure the similarity between two bags, we can also compute 
the distance between two normalized frequency histograms using the % 2 statistic 
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101 . |31| , where the j-th visual word \ 2 distance between bags B p and bag B q is 
j'-th dp q (j) = \ ^ h ^{j^+h^{j) ■ ^ ne ^ 2 distance vector between bags B p and bag 
B q is denoted as 

d Pq = [d Pq (l) d pq (2) ■■■ d pq (V)] T (4) 
In this way, we compute the x 2 distance between bags B p and bag B q as 
d(B p , B q ) = ^2d pq (j). Similarly, we assign different weights for visual words, re- 

3 

suiting in a weighted \ 2 distance defined as d w (B p , B q ) — w T d pg = J2 w j^pq(j) = 

3 

^J2 w j ^ h h TjWK Tj] • ^ e must note that, instead of weighting the histogram as 

3 pi 

f(j) = w j x h(J) directly in @, we weight the j-th visual word x 2 distance 

dpq(j)- 

3.1.2. Large Margin Based Weights Vector w Learning 

Inspired by BoostMap j^ . we learn our visual word weighting vector w = 
[u>i W2 ■ ■ ■ wv] T by classif ying the triplets of objects in the dataset. This 



methodology are also used in 



18l l33l . . Let T be a triplet index set of training 



ROIs presented as bags of features: T = {(n,p,q)\y n = y p and y n ^ y q }, 
where y n denotes the class label for bag B n . We aim to make the weighted 
similarity between same labeled images larger than that between differently 
labeled images. Ideally, the learnt weight vector w £ R+ satisfies the constraint 

w T s, ig < w T s„ p , V(n,p, q) £ T (5) 
when the distance measure is used, we have 

w T d„ p < w T d nq ,V(n,p,q) £ T (6) 
The margin of triplet <f> = (n,p,q) with respect to w, is then computed as 

p^(w) = w T s„,j - w T s„ p 

(7) 

= W X z„ 

where 

= [«„,(1) - s np {\), ■ • ■ , s nq {V) - s np (V)] T (8) 
Sng(j) = min(h n (j), h q (j)) 
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is the original margin vector. Similarly, a z^ = d np — d nq can also be defined 
using distance measure instead of similarity. 

After the margins are defined, the problem of learning feature weights w can 
be solved within the large margin framework. We perform the estimation in a 
popular margin formulations — the logistic regression formulation. We also add 
an L\ norm penalty of w to the objective function to encourage the sparseness 
of weight, which leads to the following optimization problem: 

min Q(w) = log{\ + exp(— w T x z^)) + A 1 1 w 1 1 1 ; 

■P (9) 

s.t w > 0. 

For fixed z^, ^ is a constrained convex optimization problem. Due to the 
nonncgative constraint on w, it cannot be solved directly by using gradient de- 
scent. To overcome this difficulty, we set Wj = Uj, j = 1, • • • , V and reformulate 
the problem slightly as: 

v 

min O(u) = ^log{l + ea>p(- ^u|z^')) + A| |uj || (10) 

<t> 3=1 

thus obtaining an unconstrained optimization problem. The solution of u can 
thus be readily found through gradient descent with a simple update rule: 

u^u-i(VO(u) (11) 

where VO(u) = (Al - £^ i+£p(?£%M)) z *) ® U ' ® is the Hadamard P rod " 
uct operator, n is the learning rate. 

3.2. Visual Words Selection 

After obtaining the visual words weighting vector w for bag-level histogram 
features, we can validate the selected visual words and update them accordingly. 
The visual words selection and the training local features clustering are done 
alternately. Given the initial visual words V = {vjjjLij each feature x l { in U 
will be clustered to V clusters {Cj}J =1 with {vj}J =1 as centroid as 

Cj = {xj r \Vj = argmin \\v k - xf|| 2 } (12) 
v k ev 
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We now can construct the bag-level features using the selected visual words in 
V. We denote the bag-level features (histograms) as function of visual words 
v\ , ■ ■ ■ ,vy as 

h„(V) = M1|V) ••• h n (V\V)] T (13) 



where 



h n (j\V) = h n {j\v\, ■■■ ,vv) 

1 ^ ( 14 ) 



i=l 

Obviously, the j-th bin of histogram is the function of j-th visual word instead 
of the entire visual vocabulary. 

Up to new, each bag can be represented as a histogram of the selected visual 
words. With these visual words we can also compute the margin vector for each 
triplet of bags cj) = {n,p, q) as z^(«i, • • • , vy) using (O or 

We now define the optimisation problem underlying the iterative framework 
in terms of the visual words V\ , ■ ■ ■ ,vy, and weighting vector w as follows 

rain Q(w, v±, ■ ■ ■ , vy) 

W,tJl,"' ,«v 

= ^log(l + exp(-w T x z<j,(vi, ■ ■ ■ ,v v ))) 

<$> (15) 

+ ^ll w lli; 

s.t. w > 0, Vj G Cj, j = 1, • • • , V. 

where Q(w, , vy ) is the same loss function as defined in ©, and (v± , • • • , v v ) 

specifies the bag-level feature margin for the 0-th bag triplet given the indices of 
the selected visual words v\ , • • • , vy ■ Given the learned visual words weighting 
vector w, we can further update the visual words for each cluster Cj. This is 
equivalent to minimizing <5(w,ui, ■ • ■ ,vy) in (|15[) with respect to ,vy. 
We adopt a reminiscent procedure of coordinate descent so as to update Vj for 
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each cluster as follows 

• tr (*) (t)\ 

v\ — argminQ(w, x\ ,---,Vy) 



in = argrmnQ[w, 

(*+!) • r^l (*) (*) tr\ 

where vry and i^ t+1 ' correspond to the old and updated visual word for the 
j-th cluster respectively, w is fixed to for the i-th iteration. Since w is 
unchanged during the update of 7, only the z$(vi, ■ ■ ■ ,iv) term of (|15p . i.e. 
the margin term of the logistic regression objective function, is considered here. 
Each Vj is updated while fixing all other visual words. 

After the new visual words are selected by updating = {v^ }J =1 , we can 
then re-cluster the training local features U by assigning each local feature xj r to 
its nearest visual word as (fT2")l and the updated clusters {C^}J =1 are obtained. 
Then, using the new selected visual words, each bag B n can be represented as 
a frequency vector h n (V^) and the weighting vector w can be updated by ©. 

3.3. Joint Visual Words Selection and Weighting -Concluding Results 

We summarize the traditional independent and our joint visual vocabulary 
and weight learning algorithms in Fig. [5J As we can see from Fig. [5] (a), the 
visual vocabulary V is firstly learned using a clustering algorithm, and then the 
each bag is represented as a histogram of V. After this, the bag-level features 
are weighted resulting the visual words' weighting vector w. Our Joint- ViVo in 
Fig. [2] (a) is basically similar to independent learning: using a visual vocabulary 
to build histogram as bag-level features and weight the features as visual words' 
weight. However, different from independent learning framework, the Joint- 
ViVo learns the weights and re-select the visual words from training local feature 
set alternately in a iterative procedure. Moreover, the re-clustering of the whole 



,(') 



(16) 
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training 1A is done according to the newly selected visual words V, which play 
role of centroids, while the selection of visual words V is based on the prevenient 
learned weight vector w. At the same time, the learning of the weight vector 
w is based on the bag-level features, which is the histogram of selected visual 
words V of a ROI. Thus, this enable an iterative algorithm as shown in Fig. [2] 
(b). We give the novel developed algorithm Joint- ViVo in algorithm [TJ 

Algorithm 1 Joint Learning and Weighting of Visual Words Algorithm: Joint- 

ViVo 

Require: Local feature trainning set U; 

Require: Training bag triplet set T; 

Require: Initial visual vocabulary V^; 

Require: Initial visual word weighting vector w^ ^; 

Require: Stop criterion 9. 

Cluster local features in U to initial visual words in and obtain initial 

clusters {Cf ] Y ]= x as in (UJ. 

for t = l, - - ,T do 

Represent each bag B n in each bag triplet as a bag-level feature = 

h n (V^ t_1 '') based on previous selected visual vocabulary using (Q~3|) and 

©; 

Compute original margin vector for each triplet <f> — (n,p,q) using ([8j); 
Update the visual word weighting vector by updating u^*' as (jTTJ) ; 
if ||wW - w^-^H 2 < 9 then 

Break, 
else 

Update the visual words {v^}J =1 as in (fT6|) : 
Re-cluster the training local set and obtain {C- using ()12|) . 

end if 
end for 

Output: Visual words and weighting vector w(*). 
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4. Experiments 

In this section, we present the application of the Joint- VoVi on three tissue 
classification tasks: 

1. Classifying Breast Tissue Density in Mammograms 

2. Classifying Lung Tissue in HRCT images [To| . 

3. Identifying Brain Tissue Type in MM images j^ . 

In these three tasks, the system uses a nearly identical parameter setting, in- 
cluding the number of visual words and the stopping criterion. 

4-1- Experiment I: Classifying Breast Tissue Density in Mammogram 

4- 1.1. Mammogram Dataset and Setup 

In this group of experiments, we test our method on classifying breast 
parenchymal tissue in Mammogram using bag-of-features model. The exper- 
iments are carried on a public and widely known database — Mammographic 
Image Analysis Society (MIAS) database 



This database is composed by 



the Medio- Lateral Oblique views of both breasts of 161 women (322 mammo- 
graphies) . The MIAS database provides annotations for each mammogram, and 
one of them is referred to the breast density. The images are labelled as (see 
Fig. [3}: 

1. Fatty (106 images): the breast is almost entirely fatty, 

2. Glandular (104 images): the breast contains some fibroglandular tissue, 
or 

3. Dense (112 images): the breast is extremely dense. 

Given the set of training images, local descriptors are computed around the 
pixels of the tissue and a visual vocabulary V is obtained. We chose image 
patches 3l| as the local feature for bag-of-features based classification breast 
tissue density in mammogram. A N x N square neighborhood is opened around 
each pixel. The pixels are row reordered to form a vector in an N 2 dimensional 
feature space. The patches are spaced by M pixels on a regular grid over the 
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(a) Fatty (b) Glandular (c) Dense 

Figure 3: Three MI AS images belonging to one of each MI AS category. 



area of the tissue. The visual vocabulary V with its weighting factor vector 
w is learned using our Joint- ViVo algorithm. We use Kmeans H] as a 
baseline method to learn the visual vocabulary V, and equal weighting as a 
baseline weighting vector w. We also compare our Joint- ViVo algorithm to a 
state-of-the-art vocabulary learning algorithm — InfoLoss [3j and a state-of- 
the-art weighting algorithm — Boosted Weighting [lj| [l^] . After each ROI in 
mammogram is represented as a bag-level features using V and w, we further 
perform the mammogram classification using SVM [stJ and fcNN 3^ ]. 

In order to evaluate the results, we used a leave-one-out [39 1 method, in 
which each sample is analyzed by a classifier which is trained using all other 
samples. However when working with the MIAS dataset, we leave the two 
images (left and right breast) from the same woman. Therefore for the MIAS 
database we use 320 training images and 2 for testing 161 times, changing the 
test and train images every time. When using the SVM a Gaussian kernel is 
used, and the multi-class classification is done using the one-versus-all rule . 
Overall performance rates are measured by the average value of the diagonal 
entries of the confusion table. 
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4.1.2. Results 

For evaluation, a performance curve for each visual vocabulary learning or 
weighting method is plotted in Fig. 0] showing the classification rates versus N, 
which is the size of the texton, while fixing V = 1600 and M — 2. The other 
parameters of Joint- ViVo are selected by 10-fold cross validation on the training 
data set. From the results, we have the following observations: 

1. Among InfoLoss, Boosted Weighting, their combination and Joint- ViVo, 
the proposed Joint- ViVo, gain the best performances in terms of average 
classification rate over all the 6 sizes of N. From Fig. @] (a) using SVM, as 
for Joint- ViVo, it improves InfoLoss by 5.28% and Boosted Weighting by 
6.83%. Furthermore, The combination of InfoLoss and Boosted Weighting 
improves InfoLoss by 2.17% and Boosted Weighting by 4.96%. Meanwhile, 
of all 6 texton sizes, the proposed Joint- ViVo perform best on N = 11. 
On the remaining texton sizes, their performances only slightly deterio- 
rate compared to the other algorithms. Similar phenomena can also be 
observed in Fig. @] (b). However, we must notice that the classification 
results of SVM is much better than fcNN, which is not surprising. 

2. From the results, it is clear that supervised vocabulary learning algo- 
rithms, especially Joint- ViVo and InfoLoss, outperform unsupervised Kmeans. 
Among the supervised vocabulary based methods, both the proposed 
Joint- ViVo and InfoLoss demonstrate excellent accuracy. It also shows 
that, on average, Boosted Weighting works better than the equal weighting 
method, but it can be further improved by using an effective vocabulary. 

3. According to the results, we also arrived at the conclusion that Joint- 
ViVo have better performance than all independent vocabulary learning 
and weighting methods (or the combination of them). This is because, 
after jointly learning the visual vocabulary and its weighting vector, all 
discriminative information is contained exactly by the maximizing margins 
of ROI triplets. 
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(a) SVM 
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Figure 4: Performances on MIAS dataset when changing the values of parameter N. 



We also compare the obtained results on this database with those obtained 
by Blot and Zwiggelaar 41], Oliver et al. |42§ and Anna Bosch et al. [7|. The 



experiment results are shown in Table [T] 
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Table 1: Comparison summary of the proposed method with other works that classify 
parenchymal density on MIAS database. 



Reference 


Performance (%) 


[41] 


50 


[42] 


73 


m 


91.39 


Joint- ViVo 


97.52 



We did not use all the possible ROI triplets for weighting factor learning on 
the MIAS and DDSM data sets as the training data size is large and it takes 
hours to learn a visual vocabulary and weighting vector. However, compared to 
state-of-the-art methods, the classification result by using Joint- ViVo is already 
quite competitive as only the bag-of-features based methods Q has similar per- 
formances. We attribute this result to the discriminate property of the local 
features in the data set, which is utilize by Joint- ViVo effectively. On all the 
three groups of experiments, the classification of Joint- ViVo with different an- 
notations have better performances than other classifiers. These results suggest 
the effectiveness of Joint- ViVo compared with the representative methods for 
discriminative tissue classification. 

4-2. Experiment II: Lung Tissue Classification in Diffuse Lung Disease 
4-2.1. HRCT axial images dataset and experiment setup 

In the second experiment, we classify the HRCT axial images with normal, 
emphysema, ground glass, honeycombing, and reticular patterns collected from 
a Hospital. A radiologist marked several ROIs by drawing boundaries that 
included abnormal patterns on each image. Obtained database are consists of 
174 normal, 209 honeycombing, 346 emphysema, 189 reticular and 198 ground 
glass patterns. 

For local feature extraction, the intensity and SIFT local features are calcu- 
lated in 15 x 15 and 12 x 12 regions, respectively. The local features are sampled 
uniformly by sliding the local region with M = 1,3,5 pixel step. Joint- ViVo is 
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then employed to learn the vocabulary V and weights w from local feature set 
extracted from training images. The visual vocabulary size V is set to 1500. 
Then a ROI is characterized by a histogram of quantized local features, which 
are sampled from the ROI and quantized by using the visual vocabulary. The 
weighted histogram of the intensity or SIFT local features are used as a 1500- 
dimensional vector that is classified by the classifier. We use the SVM 37f as 



the classifier to classify the histogram features to 1 of the 4 diffuse lung diseases. 
We use the \ 2 kernel as the kernel function, which is extended Gaussian kernel 
with w weighted \ 2 distance as 

k( K , K) = exP (_£ y.u v^ygf ) m 

For the multi-class classification, several SVM models are built using one versus 
one combinations, and classification is done by voting of these SVM models. 
The SVM parameters are optimized by the 2-fold cross validation. Classification 
performance is also evaluated by 2-fold cross validation. 

4.2.2. Results 

Fig. [5] shows overall classification accuracy of the intensity and the SIFT 
feature with varying feature sampling condition. Fig. [5] shows the accuracies of 
the different methods using the ratio of the number of correctly classified ROI 
to the total number of ROI. The ratio is averaged over 2 trails for each method. 

In the figure, it is clear that the fc-means learned vocabulary with SIFT 
local feature gives the worst results. This is because the fc-means simply find 
the vocabulary words according to the clustering centroid of the local feature set 
and this set is susceptible to noise has little discriminate information. The fc- 
means learned vocabulary with intensity local feature perform better due to the 
intensity's robustness to the noise. Moreover, again, Joint- ViVo outperforms 
all other visual vocabulary learning and weighting strategies (including their 
combination). The Joint- ViVo method effectively rejects noise and outliers by 
selecting the most reliable visual words. Joint- ViVo learned vocabulary gives the 
best performance on all the cases. The overall recognition performance of Joint- 
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Intensity 



Joint-ViVo 

InfoLoss + Boosted Weighting 
InfoLoss 

Boosted Weighting 
Kmeans 



(a) Intensity 



SIFT 




□ Joint-ViVo 

1$I InfoLoss + Boosted Weighting 
InfoLoss 

Boosted Weighting 
Kmeans 

3 
M 



(b) SIFT 

Figure 5: Overall classification accuracy of the intensity feature and the SIFT feature with 
different visual vocabulary learning methods. 



ViVo appears to be very competitive. This is a strong evidence of the inherent 
relationship between vocabulary and its weights, through not clear yet, by can 
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be explored and employed by learning jointly. Moreover, The accuracy of both 
features increases as the number of the sampling points M. This means that 
sufficient number of local features are necessary to calculate smooth distribution 
of the features. 

Jf.,3. Experiment III: Identifying Brain Tissue Type in MRI 
4.3.1. Tl and T2 MRI Dataset 

In the third experiment we focus on classification of individual regions of 
interest in a dataset of 24 post-Tl weighted gadolinium-enhanced MRI slices 
and 96 pre-T2 weighted MRI slices of the brain of a single mouse. Images were 
registered prior to segmentation and normalized following combination. Seg- 
mentation itself was performed by a domain expert and supported by histology 
data. In particular, we are interested in discriminating between 21 Tl and 75 
T2 manually segmented ROIs from these images, representing various types of 
tissue: cerebrospinal fluid (CSF), gray matter, tissue necrosis, hippocampus tis- 
sue, and samples from three distinct regions of tumor with varying degrees of 
vascularization, neoplastic growth, and tissue necrosis. We wish to discriminate 
between individual tissue types as well as collectively classify tissue as normal 
(CSF, hippocampus, and gray matter classes) or abnormal (necrosis, tumorl, 
tumor2, tumor3). The post-Tl ROIs were assigned labels "CSF", "Tumorl", 
" Tumor2" , and " Tumor3" , representing areas of cerebrospinal fluid, homogenous 
"typical" tumor tissue, heavily vascularized tumor tissue, and tumor tissue near 
an area of necrosis and edema. To take advantage of the imaging properties of 
T2 relaxation, we selected ROIs from the T2 image dataset in the following 
classes: "CSF", " Graymatter" , "Hippocampus", and "Necrosis". These labels 
corresponded to areas of cerebrospinal fluid, normal gray matter tissue, a re- 
gion of normal tissue located in the hippocampus, and a region of liquefactive 
necrosis near the lower central region of the tumor, respectively. 

For local feature representation, given a fixed block size, each image is de- 
composed into a number of small blocks (image patches). Based on such small 
blocks from different images, a visual vocabulary V containing visual words (key 
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blocks) is generated using different algorithms including Joint- ViVo, InfoLoss 



23j and Generalized Lloyd Algorithm (GLA) [43J. The visual weights w are 



also learned by Joint- ViVo and Boosted Weighting [13 L As a comparison, the 



Generalized Lloyd Algorithm (GLA) 43j and InfoLoss 23j are also used to gen- 



erate the visual vocabulary V, while Boosted Weighting [13j is used to learn 
the weighting of visual words. The visual word frequency vector (or weighted 
visual word frequency vector when visual words weighting is applied) is used as 
a representative feature vector of the image texture in classification. We then 
employ the Histogram Model, which has been shown effective for texture clas- 
sification as a similarity measure in fc-nearest neighbor classification 



which 



determines the class of a ROI by a majority vote of its k nearest neighbors for 
a user-specified k. 

4.3.2. Results 

We performed leave-one-out classification experiments [3^] on a combined 
dataset of 21 post-Tl weighted (gadolinium enhanced) ROIs and 75 pre-T2 
weighted ROIs extracted from 24 post-Tl and 96 pre-T2 slices of the brain of a 
single mouse afflicted with a large intracranial neoplasm. 

Average fcNN accuracies on the combined and individual Tl and T2 datasets 
for values of k ranging from 1 to 6 are shown in Fig. [6j These experiments 
are shown for different databases and they clearly and consistently illustrate 
the out-performance of Joint- ViVo with respect to GLA, InfoLoss and Boosted 
Weighting for almost all the neighbor size k and the test sets, with only few 
iterations (t < 50 in practice) . The out-performance of our visual vocabulary V 
comes essentially from the joint learning of the visual vocabulary V visual words 
weighting vector w; in almost all cases, 50 iterations was sufficient in order to 
improve the performance of the V, and few more iterations (T — 80) for the 
other cases. On the one hand, this corroborates the fact that the supervised 
learned vocabulary and weights provide state of the art performances, and on the 
other hand, their performances can be consistently improved by jointly learning 
of visual words and their weights. 
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5. Conclusion 



In this paper, we have explored the use of bag-of-features for tissue clas- 
sification problems in medical imaging. We commenced by reviewing some of 
the properties of visual vocabulary and its relationship with the visual words' 
weighting. This analysis relied on the joint learning of visual vocabulary and 
the weighting factors, just like the joint learning of features and classifiers in 
Auto-Context 
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26| . We proposed a new approach to learn the visual vocab- 



ulary and its weights design based on a iterative formed from the generative 
and discriminative approaches. The main idea is to introduce a discriminative 
model with a visual words selection criterion to select the most discriminative 
visual words in each iteration. Two of the most important properties of Joint- 
ViVo are that the weighting factor is learned based on the vocabulary and that 
vocabulary is selected based on the learned weighting factors. 

We have explored three tissue classification applications of the Joint- ViVo. 
The first of these is brain tissue types identification in MRI, the second is breast 
tissue density classification in mammorgram, and the third one is the lung tissue 
classification in HRCT axial images. In our experiments, we employed three 
medical image data sets for tissue classification problems and confirmed that the 
joint learning of visual vocabulary and weighting factors greatly improved the 
classification performance. One more application is also explored on recognition 
of natural scene categories. We compared our Joint- ViVo approach with stat-of- 
the-art visual vocabulary learning and visual words weighting approaches. Our 
approach greatly outperformed both these approaches when their classification 
performance was comparable. The novel developed Joint- ViVo algorithm is 
proven to outperform alternatives in terms of their ability to learn and weight 
the visual words for bag-of-features method. Moreover, Joint- ViV o algor ithm 
can also be used to bag- of-features based bioinformatics 
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Figure 6: Average fcNN classification accuracies for the Tl (a), T2 (b) and combined datascts 
(c). 



